{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Encrypted Data-frames\n",
    "\n",
    "The following notebook shows how to encrypt Pandas data-frames and run a left join on them using Fully Homomorphic Encryption (FHE) in a client-server setting using Concrete ML. This example is separated into three main sections : \n",
    "1) Two independent clients load their own csv file using Pandas, encrypt their data and send them to a server\n",
    "2) The server runs a left join in FHE\n",
    "3) One of the client receives the encrypted output data-frame and decrypts it \n",
    "\n",
    "In such a setting, several parties are thus able to merge private databases without ever disclosing any of their sensitive data. Additionally, Concrete ML provides a user-friendly API meant to be as close as possible to Pandas. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "import time\n",
    "from pathlib import Path\n",
    "from tempfile import TemporaryDirectory\n",
    "\n",
    "import numpy\n",
    "import pandas\n",
    "\n",
    "from concrete.ml.pandas import ClientEngine, load_encrypted_dataframe\n",
    "from concrete.ml.pytest.utils import pandas_dataframe_are_equal\n",
    "\n",
    "numpy.random.seed(0)\n",
    "\n",
    "DATA_PATH = Path(\"data/encrypted_pandas\")\n",
    "\n",
    "# pylint: disable=pointless-statement, consider-using-with"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clients"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### User 1\n",
    "\n",
    "On the first user's side, load the private data using Pandas. This example uses the [Tips]( https://www.kaggle.com/code/sanjanabasu/tips-dataset/input) dataset. It was split into two csv files so that: \n",
    "- all columns are different, except for column \"index\", representing the initial data-frame's index\n",
    "- some indexes are common, some others are not"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>sex</th>\n",
       "      <th>smoker</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>client_1</td>\n",
       "      <td>12.54</td>\n",
       "      <td>2.50</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>client_2</td>\n",
       "      <td>11.17</td>\n",
       "      <td>1.50</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>client_3</td>\n",
       "      <td>20.29</td>\n",
       "      <td>2.75</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>client_4</td>\n",
       "      <td>14.07</td>\n",
       "      <td>2.50</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>client_5</td>\n",
       "      <td>15.69</td>\n",
       "      <td>3.00</td>\n",
       "      <td>Male</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>client_6</td>\n",
       "      <td>18.29</td>\n",
       "      <td>3.00</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>client_7</td>\n",
       "      <td>16.93</td>\n",
       "      <td>3.07</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>client_8</td>\n",
       "      <td>24.27</td>\n",
       "      <td>2.03</td>\n",
       "      <td>Male</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>client_9</td>\n",
       "      <td>8.77</td>\n",
       "      <td>2.00</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      index  total_bill   tip     sex smoker\n",
       "0  client_1       12.54  2.50    Male     No\n",
       "1  client_2       11.17  1.50  Female     No\n",
       "2  client_3       20.29  2.75  Female     No\n",
       "3  client_4       14.07  2.50    Male     No\n",
       "4  client_5       15.69  3.00    Male    Yes\n",
       "5  client_6       18.29  3.00    Male     No\n",
       "6  client_7       16.93  3.07  Female     No\n",
       "7  client_8       24.27  2.03    Male    Yes\n",
       "8  client_9        8.77  2.00    Male     No"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "CLIENT_1_DIR = DATA_PATH / \"client_1\"\n",
    "\n",
    "df_left = pandas.read_csv(CLIENT_1_DIR / \"df_left.csv\")\n",
    "\n",
    "df_left"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to be encrypted, string values first need to be mapped to integers (see section below about `get_schema`). By default, this mapping is done automatically. However, for example, the column won't be able to be selected when merging encrypted data-frames. This is because such an operator requires the data-frames' string mapping to match, else values will be mixed up.\n",
    "\n",
    "This is exactly the case here, as the index column only contains string values, thus the mapping must be defined by the application developer. This mapping will then be shared to the second client (see below) in order to make sure both matches. Other non-integer columns do not require any pre-computed mapping if they are not expected to be selected for merging. All mappings are grouped per column as a dictionary, called \"schema\". \n",
    "\n",
    "Therefore, let's define our schema:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = {\"index\": {index_value: i + 1 for i, index_value in enumerate(df_left[\"index\"].values)}}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `ClientEngine` instance is then initialized, which is used for managing keys (encryption, decryption)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "client_1_temp_dir = TemporaryDirectory(dir=str(CLIENT_1_DIR))\n",
    "client_1_temp_path = Path(client_1_temp_dir.name)\n",
    "\n",
    "# Define the directory where to store the serialized keys\n",
    "client_1_keys_path = client_1_temp_path / \"keys\"\n",
    "\n",
    "client_1 = ClientEngine(keys_path=client_1_keys_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `ClientEngine` instance, the user is now able to encrypt the Pandas data-frame, building a new `EncryptedDataFrame` instance. The schema, which includes the string mapping for column `index`, is provided as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_left_enc = client_1.encrypt_from_pandas(df_left, schema=schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`EncryptedDataFrame` objects are able to handle multiple data-types: integers, floating points and strings. Under the hood, the data needs to be quantized under a certain precision before encryption (more info on why: [Key Concepts](../getting-started/concepts.md) and [Quantization](../explanations/inner-workings/quantization_internal.md) ). More specifically:\n",
    "- integers: the values are kept as they are but an error is raised if they are not within the range currently allowed\n",
    "- floating points: the values are quantized under a certain precision, and quantization parameters (scale, zero-point) are sent to the server\n",
    "- strings: the values are mapped to integers using a dict, which is sent to the server as well\n",
    "\n",
    "More generally, the quantized values must be within the range currently allowed. This notably means that the number of rows allowed in a data-frame are also limited, as keys on which to merge are expected to be unique.\n",
    "\n",
    "Once the inputs are quantized and encrypted, the user can print the encrypted data-frame's schema. A schema represents the data-frame's columns as well as their dtype and associated quantization parameters or mappings.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>sex</th>\n",
       "      <th>smoker</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>dtype</th>\n",
       "      <td>object</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>object</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>str_to_int</th>\n",
       "      <td>{'client_1': 1, 'client_2': 2, 'client_3': 3, ...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>{'Male': 1, 'Female': 2}</td>\n",
       "      <td>{'No': 1, 'Yes': 2}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>scale</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.903226</td>\n",
       "      <td>8.917197</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>zero_point</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.92129</td>\n",
       "      <td>12.375796</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                        index total_bill  \\\n",
       "dtype                                                  object    float64   \n",
       "str_to_int  {'client_1': 1, 'client_2': 2, 'client_3': 3, ...        NaN   \n",
       "scale                                                     NaN   0.903226   \n",
       "zero_point                                                NaN    6.92129   \n",
       "\n",
       "                  tip                       sex               smoker  \n",
       "dtype         float64                    object               object  \n",
       "str_to_int        NaN  {'Male': 1, 'Female': 2}  {'No': 1, 'Yes': 2}  \n",
       "scale        8.917197                       NaN                  NaN  \n",
       "zero_point  12.375796                       NaN                  NaN  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_left_enc.get_schema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The encrypted data-frame can be serialized and saved using the `save` method. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_left_enc_path = client_1_temp_path / \"df_left_enc\"\n",
    "df_left_enc.save(df_left_enc_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### User 2\n",
    "\n",
    "The second user's steps are very similar to the first one. It is important to note that both users are expected not to share any of their data-base with each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>client_2</td>\n",
       "      <td>Thur</td>\n",
       "      <td>Lunch</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>client_5</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>client_9</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      index   day    time  size\n",
       "0  client_2  Thur   Lunch     2\n",
       "1  client_5   Sat  Dinner     3\n",
       "2  client_9   Sun  Dinner     2"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "CLIENT_2_DIR = DATA_PATH / \"client_2\"\n",
    "\n",
    "df_right = pandas.read_csv(CLIENT_2_DIR / \"df_right.csv\")\n",
    "\n",
    "df_right"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Currently, the users need to share the private keys in order to be able to run an encrypted merge. Future works will provide new techniques that would avoid this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "client_2_temp_dir = TemporaryDirectory(dir=str(CLIENT_2_DIR))\n",
    "client_2_temp_path = Path(client_2_temp_dir.name)\n",
    "\n",
    "# Define the directory where to store the serialized keys\n",
    "client_2_keys_path = client_2_temp_path / \"keys\"\n",
    "\n",
    "# Copy the first user's keys\n",
    "shutil.copy2(client_1_keys_path, client_2_keys_path)\n",
    "\n",
    "client_2 = ClientEngine(keys_path=client_2_keys_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Encrypt the second user's data-frame. Here, the same schema used for client 1 is needed in order to make sure that custom mappings are matching.\n",
    "\n",
    "It is possible to get the encrypted data-frame's representation by simply returning the variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>index</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>..3a5cea3d89..</td>\n",
       "      <td>..5e1c826a69..</td>\n",
       "      <td>..a79208da47..</td>\n",
       "      <td>..4a46bbeabe..</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>..c959c81aa2..</td>\n",
       "      <td>..a992e555a0..</td>\n",
       "      <td>..ad73b4157e..</td>\n",
       "      <td>..947b705b07..</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>..ac944db8d7..</td>\n",
       "      <td>..76d5b0360e..</td>\n",
       "      <td>..c0544afe88..</td>\n",
       "      <td>..fa7f74cea1..</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "         index            day           time           size\n",
       "..3a5cea3d89.. ..5e1c826a69.. ..a79208da47.. ..4a46bbeabe..\n",
       "..c959c81aa2.. ..a992e555a0.. ..ad73b4157e.. ..947b705b07..\n",
       "..ac944db8d7.. ..76d5b0360e.. ..c0544afe88.. ..fa7f74cea1.."
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_right_enc = client_2.encrypt_from_pandas(df_right, schema=schema)\n",
    "\n",
    "df_right_enc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the second user's encrypted data-frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_right_enc_path = client_2_temp_path / \"df_right_enc\"\n",
    "df_right_enc.save(df_right_enc_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Server\n",
    "\n",
    "The server only receives serialized encrypted data-frames. Once it has them, anyone is able to decide which operation to run on which data-frames, but only the parties that encrypted them will be able to decrypt the result.\n",
    "\n",
    "First, the server can deserialize the data-frames using Concrete ML's `load_encrypted_dataframe` function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_left_enc = load_encrypted_dataframe(df_left_enc_path)\n",
    "df_right_enc = load_encrypted_dataframe(df_right_enc_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The server can now run a left join on the encrypted data-frames' common column \"index\" using FHE. This step can take several seconds.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total execution time: 6.84s\n"
     ]
    }
   ],
   "source": [
    "start = time.time()\n",
    "df_joined_enc_server = df_left_enc.merge(df_right_enc, how=\"left\", on=\"index\")\n",
    "end = time.time() - start\n",
    "\n",
    "print(f\"Total execution time: {end:.2f}s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The encrypted output data-frame is then serialized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_joined_enc_server_path = client_1_temp_path / \"df_joined_enc\"\n",
    "\n",
    "df_joined_enc_server.save(df_joined_enc_server_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clients\n",
    "\n",
    "Both user 1 and 2 are able to decrypt the server's encrypted output data-frame, but it first needs to be deserialized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_joined_enc = load_encrypted_dataframe(df_joined_enc_server_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The user can now decrypt it and recover the joined data-frame as a Pandas `DataFrame` object. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>sex</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>client_1</td>\n",
       "      <td>12.091429</td>\n",
       "      <td>2.509286</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>client_2</td>\n",
       "      <td>10.984286</td>\n",
       "      <td>1.500000</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>Thur</td>\n",
       "      <td>Lunch</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>client_3</td>\n",
       "      <td>19.841429</td>\n",
       "      <td>2.733571</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>client_4</td>\n",
       "      <td>14.305714</td>\n",
       "      <td>2.509286</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>client_5</td>\n",
       "      <td>15.412857</td>\n",
       "      <td>2.957857</td>\n",
       "      <td>Male</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>client_6</td>\n",
       "      <td>18.734286</td>\n",
       "      <td>2.957857</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>client_7</td>\n",
       "      <td>16.520000</td>\n",
       "      <td>3.070000</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>client_8</td>\n",
       "      <td>24.270000</td>\n",
       "      <td>2.060714</td>\n",
       "      <td>Male</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>client_9</td>\n",
       "      <td>8.770000</td>\n",
       "      <td>1.948571</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      index  total_bill       tip     sex smoker   day    time  size\n",
       "0  client_1   12.091429  2.509286    Male     No   NaN     NaN   NaN\n",
       "1  client_2   10.984286  1.500000  Female     No  Thur   Lunch   2.0\n",
       "2  client_3   19.841429  2.733571  Female     No   NaN     NaN   NaN\n",
       "3  client_4   14.305714  2.509286    Male     No   NaN     NaN   NaN\n",
       "4  client_5   15.412857  2.957857    Male    Yes   Sat  Dinner   3.0\n",
       "5  client_6   18.734286  2.957857    Male     No   NaN     NaN   NaN\n",
       "6  client_7   16.520000  3.070000  Female     No   NaN     NaN   NaN\n",
       "7  client_8   24.270000  2.060714    Male    Yes   NaN     NaN   NaN\n",
       "8  client_9    8.770000  1.948571    Male     No   Sun  Dinner   2.0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_joined_cml = client_1.decrypt_to_pandas(df_joined_enc)\n",
    "\n",
    "df_joined_cml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Concrete ML vs Pandas comparison\n",
    "\n",
    "For this demo, expected output from Pandas (in a non-private setting) can be computed and compared to the result above. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>sex</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>client_1</td>\n",
       "      <td>12.54</td>\n",
       "      <td>2.50</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>client_2</td>\n",
       "      <td>11.17</td>\n",
       "      <td>1.50</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>Thur</td>\n",
       "      <td>Lunch</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>client_3</td>\n",
       "      <td>20.29</td>\n",
       "      <td>2.75</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>client_4</td>\n",
       "      <td>14.07</td>\n",
       "      <td>2.50</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>client_5</td>\n",
       "      <td>15.69</td>\n",
       "      <td>3.00</td>\n",
       "      <td>Male</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Sat</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>client_6</td>\n",
       "      <td>18.29</td>\n",
       "      <td>3.00</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>client_7</td>\n",
       "      <td>16.93</td>\n",
       "      <td>3.07</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>client_8</td>\n",
       "      <td>24.27</td>\n",
       "      <td>2.03</td>\n",
       "      <td>Male</td>\n",
       "      <td>Yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>client_9</td>\n",
       "      <td>8.77</td>\n",
       "      <td>2.00</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      index  total_bill   tip     sex smoker   day    time  size\n",
       "0  client_1       12.54  2.50    Male     No   NaN     NaN   NaN\n",
       "1  client_2       11.17  1.50  Female     No  Thur   Lunch   2.0\n",
       "2  client_3       20.29  2.75  Female     No   NaN     NaN   NaN\n",
       "3  client_4       14.07  2.50    Male     No   NaN     NaN   NaN\n",
       "4  client_5       15.69  3.00    Male    Yes   Sat  Dinner   3.0\n",
       "5  client_6       18.29  3.00    Male     No   NaN     NaN   NaN\n",
       "6  client_7       16.93  3.07  Female     No   NaN     NaN   NaN\n",
       "7  client_8       24.27  2.03    Male    Yes   NaN     NaN   NaN\n",
       "8  client_9        8.77  2.00    Male     No   Sun  Dinner   2.0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_joined_pandas = pandas.merge(df_left, df_right, how=\"left\", on=\"index\")\n",
    "\n",
    "df_joined_pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Slight differences cab be observed between Pandas and Concrete ML with floating points values. This is only due to quantization artifacts, as currently only 4 bits of precision are supported. Still, both data-frames are equal under a small float relative tolerance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Concrete ML data-frame is equal to Pandas data-frame: True \n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_are_equal = pandas_dataframe_are_equal(\n",
    "    df_joined_pandas, df_joined_cml, float_rtol=0.1, equal_nan=True\n",
    ")\n",
    "\n",
    "print(\"Concrete ML data-frame is equal to Pandas data-frame:\", df_are_equal, \"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clean the temporary directories and their content\n",
    "client_1_temp_dir.cleanup()\n",
    "client_2_temp_dir.cleanup()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Concrete ML provides a way for multiple parties to run Pandas operations on their data-frames without ever disclosing any sensitive data. This is done through a Pandas-like API that enables users to encrypt the data-frames and a server to run the operations in a private and secure manner using Fully Homomorphic Encryption (FHE). The users are then able to decrypt the output and obtain a result similar to what Pandas would have provided in a non-private setting.  \n",
    "\n",
    "#### Future Work\n",
    "\n",
    "In the near future, bigger precisions will be allowed, which would make encrypted data-frames able to handle larger integers, floating points with better precisions and more unique strings values, as well as provide more rows. Support for more encrypted operations on data-frames will also be added. While users need to share private keys with the current version of the API, threshold decryption, a multi party key generation protocol, could allow them to compute on joint data without revealing it to each other."
   ]
  }
 ],
 "metadata": {
  "execution": {
   "timeout": 10800
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
